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Abstract 

Probabilistic k-nearest neighbour (PKNN) classification has been introduced to 
improve the performance of original k-nearest neighbour (KNN) classification al- 
gorithm by explicitly modelling uncertainty in the classification of each feature 
vector. However, an issue common to both KNN and PKNN is to select the op- 
timal number of neighbours, k. The contribution of this paper is to incorporate 
the uncertainty in k into the decision making, and in so doing use Bayesian model 
averaging to provide improved classification. Indeed the problem of assessing 
the uncertainty in k can be viewed as one of statistical model selection which is 
one of the most important technical issues in the statistics and machine learning 
domain. In this paper, a new functional approximation algorithm is proposed to 
reconstruct the density of the model (order) without relying on time consuming 
Monte Carlo simulations. In addition, this algorithm avoids cross validation by 
adopting Bayesian framework. The performance of this algorithm yielded very 
good performance on several real experimental datasets. 

Keywords: Bayesian Inference, Model Averaging, K-free model order 
estimation 



1. Introduction 

Supervised classification is a very well studied problem in the machine learn- 
ing and statistics literature, where the k— nearest neighbour algorithm (KNN) is 
one of the most popular approaches. It amounts to assigning an unlabelled class 
to the most common class label among k neighbouring feature vectors. One of the 
key issues in implementing this algorithm is choosing the number of neighbours 
k, and various flavours of cross validation are used for this purpose. However a 



Preprint submitted to Pattern Recognition 



May 7, 2013 



drawback to kNN is that it does not have a probabilistic interpretation, for exam- 
ple, no uncertainty is associated with the inferred class label. 

There have been several recent papers which addressed this deficiency, 01 
0,0] • Indeed from such a Bayesian perspective the issue of choosing the value 
of k can be viewed as a model (order) selection problem. To date, there exist 
several different approaches to tackle the model selection problem. One of the 
most popular approaches is based on information criteria including the Akaike 
Information Criterion (AIC), the Schwarz's Bayesian Information Criterion (BIC) 
and the Deviance Information Criterion (DIC) [|5|,|6|,|7l]. Given a particular model 
M k , the well-known AIC and BIC are defined by AIC(M k ) = -2 log L(M k ) + 
2e(M k ) and BIC(M k ) = -2\ogL(M k ) + e(M k )\ogN for N observations 
where L(Ai k ) and e(Ai k ) denote the likelihood and the number of parameters of 
Ai k , respectively. 

It is known that many fast functional approximations or information criterion 
techniques do not adequately approximate the underlying posterior distribution of 
the model order. Furthermore, Monte carlo based estimators can provide approxi- 
mate distributions of the model order, but typically require excessive computation 
time. 

Our main contribution is to propose a new functional approximation technique 
to infer the posterior distribution of the model order, p(K\y) where K and y 
denote the model order and observations, respectively. In particular, this paper 
demonstrates the applicability of the proposed algorithm by addressing the prob- 
lem of finding the number of neighbours k for probabilistic k- Nearest Neighbour 
(PKNN) classification. In addition, we designed a new symmetrized neighbouring 
structure for the KNN classifier in order to conduct a fair comparison. From an 
application point of view, we also classified several benchmark datasets and a few 
real experimental datasets using the proposed algorithms. 

In addition to model selection, we also consider improvements of the KNN 
approach itself for the purpose of a fair comparison. Although conventional KNN 
based on euclidean distance is widely used in many application domains, the con- 
ventional KNN is not a correct model in that it does not guarantee the symmetric - 
ity of the neighbouring structure. 

It is important to state that PKNN formally defines a Markov random field over 
the joint distribution of the class labels. In turn this yields a complication from 
an inferential point of view, since it is well understood that the Markov random 
field corresponding to likelihood of the class labels involves an intractable nor- 
malising constant, sometimes called the partition function in statistical physics, 
rendering exact calculation of the likelihood function almost always impossible. 
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Inference for such complicated likelihoods function is an active field of research. 
In the context of PKNN [1] and [3] use the pseudo-likelihood function [8fl as an 
approximation to the true likelihood. While [2] and Q4J] consider improvements 
to pseudolikelihood by using a Monte Carlo auxiliary variable technique, the ex- 
change algorithm, D91] which targets the posterior distribution which involves the 
true intractable likelihood function. Bayesian model selection is generally a com- 
putationally demanding exercise, particularly in the current context, due to the 
intractability of the likelihood function, and for this reason we use a pseudolike- 
lihood approximation throughout this paper, although our efforts are focused on 
efficient means to improve upon this aspect using composite likelihood approxi- 
mations 

This paper consists of several sections. Section [3] includes the background 
of the statistical approaches used in this paper. This section shows two main 
techniques, k-Nearest Neighbour (KNN) classification and Integrated Laplace 
Approximation (INLA). For the extended literature review, probabilistic kNN 
(PKNN) is explained with details. The proposed algorithm is introduced in the 
section SI In this section, we introduce a generic algorithm to reconstruct and ap- 
proximate the underlying model order posterior p(K\y) and to efficiently search 
for the optimal model order K*. Afterwards, this section includes how to apply 
the generic algorithm into PKNN. In section [51 PKNN adopting the proposed al- 
gorithms have applied to several real datasets. Finally, we conclude this paper 
with some discussion of sections [6] and |7J 



2. Related Work 

One of the main aims of this paper is to explore nearest neighbour classifi- 
cation from a model selection perspective. Some popular model selection ap- 
proaches in the literature include the following. Grenander et al. [Ul[|l2ll proposed 
a model selection algorithm which is based on jump-diffusion dynamics with the 
essential feature that at random times the process jumps between parameter spaces 
in different models and different dimensions. Similarly, Markov birth-death pro- 
cesses and point processes can be considered. One of the most popular approaches 
to infer the posterior distribution and to explore model uncertainty is Reversible 
Jump Markov Chain Monte Carlo developed by Richardson and Green |1 1 3f| - The 
composite model approach of Carlin and Chib [14] is a further approach. The 
relationships between the issue of choice of pseudo-prior in the case of Carlin and 
Chib's product composite model and the choice of proposal densities in the case 



of reversible jump are discussed by Godsill [|15ll . 
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In addition, there are a lot of similarities in the clustering domain. For in- 
stance, many clustering algorithms such as K-means algorithms, Gaussian Mix- 
ture Model (GMM), and Spectral clustering have also the challenging difficulty to 
infer the number of clusters K as similarly shown in the estimation of the number 
of neighbours K of the (P)KNN. 



3. Statistical Background 



3.1. k-Nearest Neighbour (kNN) model 

In pattern recognition, the k-Nearest Neighbour algorithm (kNN) is one of 
the most well-known and useful non-parametric methods for classifying and clus- 
tering objects based on classified features which are close, in some sense, in the 
feature space. The kNN is designed with the concept that labels or classes are 
determined by a majority vote of its neighbours. However, along with such a sim- 
ple implementation, the kNN has a sensitivity problem from the locality which 
are generated from two difficult problems: estimating the decision boundary to 
determine the boundary complexity and the number of neighbours to be voted. 
In order to address this problem, adaptive kNN is proposed to efficiently and ef- 



fectively calculate the number of neighbours and the boundary [16, 0, llSi LL90 



In addition, the probabilistic kNN (PKNN) model which is more robust than the 
conventional kNN has been introduced and developed by Markov chain Monte 
carlo to estimate the number of neighbours (111 13] • In this paper, we use the PKNN 
model since it provides proper likelihood term given a particular model with k 
neighbours. 



(a) Given data 




(b) Asymmetric 
PKNN (K=2) 



.0.5 



0.5* -\ 



(c) Symmetrised 
Boltzmann Model 
for PKNN 



Figure 1 : Topological Explanation of PKNN 



4 



3.1.1. An asymmetric Pseudo-likelihood of PKNN 

Let {Oi, yi), (z 2 , y 2 ), ■ • • , (z N , y N )} where each z { G {1, 2, • ■ ■ , C} denote 
the class label and d dimensional feature vector y^ 6 Then, the pseudo- 
likelihood of the probabilistic kNN (PKNN) proposed by [Q]] can be formed as 

P (*\y, P,K)*H I J (i) 

where the unknown scaling value /3 > and C is a set of classes, K denotes the 
number of neighbours and 5 a ^ = 1 if a = b and otherwise. In this equation, 
ne(-) represents the set of neighbours. 

Suppose that we have four data points as shown in Fig. \}}(a). Given K = 2, 
we have an interesting network structure in Fig. [TJ(b) from this conventional 
PKNN. In this subgraph, arrows direct the neighbours. As we can see in the Fig. 
[Q-(b), some pairs of data points (nodes) are bidirectional but others are unidirec- 
tional, resulting in an asymmetric phenomena. Unfortunately, this asymmetric 
property does not satisfy the Markov Random Field assumption which can be 
implicitly applied in Eq. ([I]). 

3.1.2. A symmetrised Boltzmann modelling for pseudo-likelihood of PKNN 
Since the pseudo-likelihood of the conventional probabilistic kNN is not sym- 
metrised an approximate symmetrised model has been proposed for PKNN [20] 
as 



exp j 




\ 


E ce c ex P \ 





N 

(2) 

The Boltzmann modeling of PKNN resolves the asymmetric problem which arises 
from the conventional PKNN of Eq. However, the Boltzmann modeling 

reconstructs the symmetrised network by averaging the asymmetrised effects from 
the principal structure of PKNN as shown in Fig. [U-(c). This brings different 
interaction rate among the edges. In the subgraph, two edges have a value of a half 
and all others have a value of one and so this difference may yield an inaccurate 
Markov Random Field model again. 
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3.2. Estimation of PKNN by Markov chain Monte Carlo (MCMC) - a conven- 
tional way 

The most popular approach to estimate the parameters of PKNN is using 
Markov chain Monte Carlo (MCMC). In this paper, PKNN via MCMC is also 
used for performance comparison. In particular, there are two different version of 
MCMC. 

The first approach is to infer the unknown model parameters ((3 and K) in the 
training step via MCMC. Afterward, given these estimate values, we can classify 
the new data from the testing set straightforwardly using the conditional poste- 
rior p{zi\y, z, y', /3, K). Suppose that we need to reconstruct the target posterior 
p(/3, K\z,y) given the observations z and y which is a set of training data. The 
standard MCMC approach uses a Metropolis-Hasting (MH) algorithm, so that 
each unknown parameter is updated according to an acceptance probability 

, . J 1 P (z\yJ,K)p0)p(K)q((3,K) \ 
A = mm < 1, — — > 

\ p(z\y,f3,K)p(f3)p(K)q(f3,K)j 

where /3 and K denote the proposed new parameters. In the training step, we 
estimate (3 and K from the above MCMC simulation. Afterwards, we simply 
classify the testing datasets given /3 and K. That is, given a testing set we can 
estimate the classes by 

z* = aig z > maxp(z' |y, z, y', p, K) 

for a new test data y' and its unknown label z . However, since the uncertainty 
of the model parameters is ignored in the testing step of the first approach, the 
first approach with two separate steps (training and testing) is less preferred from 
a statistical point of view although it is often used in practice. Unlike the first 
approach, the second approach jointly estimates the hidden model parameters to 
incorporate this uncertainty while classifying the testing datasets. In the second 
approach we reconstruct not the conditional distribution p(z |y, z, y', /3, K) but a 
marginalized distribution p(z \y, z, y ) by jointly estimating parameters. In this 
case, the target density is not p(j3,K \z,y) but p(j3, K, z'\z, y, y'). Then each 
unknown parameter from the marginalized density is updated according to the 
modified acceptance probability 

A - min [ 1 P( *' Z|y ' y " £ K)P0)P(K)q(z , g, K) \ 
\ ' P (z',z\y, yi ,l3,K)p(l3)p{K)q(z'J,k)f' 
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In this paper, we use the second approach to infer the parameters and classify the 
data for MCMC simulation for comparison since the joint estimation to obtain the 
marginalized distribution considers the uncertainty even in the classification of the 
new dataset. We simply design q(z ,(3,K) = q(z )q((3)q(K) and each proposal 
distribution is defined by 



q(z ) 

Q0) 
q(K) 



p(z\y,p,K) 

Af0;f3,O.l) 

P (k) 1 



p(z,z\y,yj,P,k) 
EcecK^' = c,z|y,yi, $,K) 



(4) 



where we set f3 a = 2 and /3& = 10 for the Gamma distribution. Given this par- 
ticular setting of the proposal distribution, we obtain the simplified acceptance 
probability 



A = min < 1, 



J2secP( z ' = s > z \y,yhP> K )p{P)q0) J 



(5) 



3. 3. Integrated Nested Laplace Approximation ( INLA ) 

Suppose that we have a set of hidden variables f and a set of observations 
y, respectively. MCMC can of course be used to infer the marginal density 
p(f |y) = J p(f, 6\y)d6 where 6 is a set of control parameters. In order to effi- 
ciently build the target density, we apply a remarkably fast and accurate functional 
approximation based on the Integrated Nested Laplace Approximation (INLA) 
developed by 112 ill . This algorithm approximates the marginal posterior p(f \y) by 



p(f\y) 



p(f\y,e)p(6\y)de 
p(f\y,0)p(6\y)d9 

J2p(i\y,0)P(9\y)A 6i 



(6) 



where 



P(9\y) oc 



p(f,y,o) 
p F (f\y,0) 



f=f*(9) 



p(y\f,6)p(f\6)p(0) 

pf{W) 



(7) 



f=f*(6») 
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Here, F denotes a simple functional approximation close to p(f \y, 9) such as 
a Gaussian approximation and i*(6) is a value of the functional approximation. 
For the simple Gaussian approximation case, the proper choice of f*(0) is the 
mode of the Gaussian approximation of pa(f\y, 9). Given the log of the posterior, 
we can calculate the mode 9* and its Hessian matrix Hg via Quasi-Newton style 
optimization by 9* = arg 6 ,maxlogp(6 l |3^) and HJ. Finally we do a grid search 
from the mode in all directions until log p(9*\y) — logp(6'|3^) > cp, for a given 
threshold ip. 



4. Proposed Approach 

Our proposed algorithm estimates the underlying densities for the number of 
neighbours of probabilistic kNN classification by using Eq. ©. To distinguish it 
from other model selection approaches, we term this approach KOREA, which is 
an acronym for "K-ORder Estimation Algorithm" in a Bayesian framework. 

4.1. Obtaining the optimal number of neighbours K* 

Let y denote a set of observations and let be a set of the model parameters 
given a model order K. The first step of KOREA is to estimate the optimal number 
of neighbours, K*\ 

K* = &rg K max p(K\y). (8) 

According to Eq. ©, we can obtain an approximated marginal posterior distribu- 
tion by 



p{K\y) oc 



(9) 

i K {K)={* K {K) 



PF (f K \y,K) 

This equation has the property that K is an integer variable while 9 of Eq. © is in 
general a vector of continuous variables. By ignoring the difference, we can still 
use the Quasi-Newton method to efficiently obtain optimal K*. Alternatively, we 
can also calculate some potential candidates between 1 and K ma _ x if K max is not 
too large. Otherwise, we may still use the Quasi-Newton style algorithm with a 
rounding operator which transforms a real value to an integer for K. 

4.2. Bayesian Model Selection for PKNN classification 

In general, one of the most significant problems in classification is to infer the 
joint posterior distribution of L different hidden classes for L different observa- 
tions such that z'j! L = arg z / maxp(z' 1 . i |y, z, y\. L ). However, jointly inferring 
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the hidden variables is not straightforward therefore we make the assumption that 
the hidden class of the i-th observation z\ is independent to one of the j-th obser- 
vation given the i-th observation y i where i ^ j and then we have the following 
simpler form (similar to Naive Bayes): 

L 

i=i 

where p(z' i \y, z, y ■) is estimated by Eq. (fTD) . 

4.2.1. PKNN via KOREA 

In the probabilistic kNN model (PKNN), let us define the new dataset with L 
data by y 1L , which is not labeled yet. The unknown labels are denoted by z 1:L . 
Here we use y[ and z i for the ith new observation and its hidden label. That is, we 
have a hidden variable f K = z i of interest given z = z 1:N , y = y 1:N and y^ such 
that y = (z, y, y •). The target posterior is obtained in a similar form to Eq. © as 



p(^|y,z, yi ) = P (z i \y)= P ( Zl ,p,K\y)dpdK 

p&\p, y, K) P (/3\y, K) P {K\y)d(5dK 

K m 

ax 

EE [MP®,y,K=jww\y,K 



x p{K = j\y)Ap (m) ] 

K n 



E E [p&\p®>y>K=3W im) \y>K=j) 

p{m) 3=1 

x p(K = j\y)A^ m) ] 



Kn 



p(m) 3=1 



where 



a h = p(^\y,K = j)p(K = j\y)A (}{m) 

Now we need to know three distributions in the above equation. 
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1. p(z! i \p( m \y,K = i): conditional likelihood 

2. p(p^\y,K = i): posterior of p 

3. p(K\y): posterior of K 

The first equation among the three above is the conditional distribution and it is 
defined by 



P (z' t \^ m \y,K = j) 



p(z,z' l \f3( m \y,K = j) 
J2 ceC p(z,z' l = c\(3^,y,K = j) 



(13) 



This is a likelihood function given the neighbouring structure. That is, p(z, z'\ 
/3^ m \y, K = j) explains the fitness between the assumed/given labels (z, and 
the given full data (y, y •) 

Another equation is p(P^ \y, K = j) but we defer the estimation of this 
distribution since it can be automatically estimated when we estimate the last dis- 
tribution p(K\y). Therefore, we infer the last equation first. The last equation is 
the marginal posterior of K and using a similar approach to INLA it is defined by 



P(K\y) oc 



p(z,P,K\y,yi) 



PG(P\z,y,y't,K) 



E CG cP(4 z ,/3,^|y ; yi) 



f3=f3*(K) 



PG((3\z,y,yl,K) 



p{P)p{K) £ ceC p{z i = c,z\f3,K, y, y[) 



PGW\z,y,yi,K) 



13=13* (K) 

(14) 



13=13* (K) 



As we can see the denominator is the approximation of the second distribution 
of interest so we can reuse it i.e. p((3\y,K) = po(P\z, y, y i; K) which is a 
Gaussian approximation of p(/3 \ y, K) oc p(z|y, y^, K)p(f3) = Yl<cecP( z 'i = 
c,z|y,y-,X)p(^). 

We also easily obtain the marginal posterior of j3 which is p(j3\y). Since the 
marginal posterior is approximated by p(/3\y) « p(P\y) = ^2j=™P(P\y, K = 
j)p(K = j\y), we can simply reconstruct the distribution by reusing the pre- 



viously estimated distributions. When we have nf = E(p\y,K 
of 2 = V(f3\y,K = j), then we have 



and 



K n 



A'n 



VP 



= E 



ctjfip and <7g 



= E 



a. 



a 



+ \VP 



(15) 
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Finally, we can obtain the target distribution of interest p(z' i |y, z, y^) with three 
distributions. Since we can now estimate the target distribution as a mixture dis- 
tribution, we can also obtain the expectation and variance as follows: 



Kn 



^(m) j = l 



n<m = E E A ! m) £& + {e«) 



o(m) j = l 



(16) 



where ^ = E(z[\y, 0M, K = j) and . = V(z^y,^ m \K = j). Here 
p((3) = G((3] oa, and ZC?(-; a, b) represents inverse Gamma distribution with 
hyper-parameters a and b. In this paper, we set a = 2 and 6 = 10 yielding an 
almost flat prior. 

4.3. Additional Neighbouring Rules 

4.3.1. A Boltzmann modelling with equal weights 

In the conventional Boltzmann modelling for the neighbouring structure, the 
interaction rate (3 is divided by a fixed K as shown in Eq. ©. This results in each 
neighbour having its own different weight. Therefore, we need to apply an equal 
weight to the neighbours by varying K for the different neighbouring structure. 
In order to build this strategy, we adopt three sequential approaches: (i) obtain 
a neighbour structure in the same way as conventional Boltzmann modelling; (ii) 
modify the structure by transforming from a directed graph to an undirected graph. 
If j E ne(i) but i ^ ne(j) then we add i into ne(j) for i ^ j; and (iii) apply the 
pseudo likelihood for the likelihood. In this paper, we name this modelling as 
Boltzmann^ modelling. 



5. Simulation Results 

The performance of our algorithm is tested with a collection of benchmark 
datasets. All of the datasets (test and training) used in this paper can be found 
at |http : / /maths ci . ucd . ie/ ~nial/dnn"7j The six well-known bench- 
mark datasets are presented in Table Q] We test the performance by using 4-fold 
cross validation for a fair comparison with all approaches although our proposed 
approach does not not require it due to the Bayesian nature of it. 
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Algorithm 1 PKNN classifier via KOREA 

Require: Given N observations, (y, z) = {yx-.Ni zi : jv)> a new testing set with L 
observations, y = y 1:L and a set of classes C 
1: for i = 1 to L do 

2: Obtain a new observation y^ and set y = (y, z, y-, /3). 

- Calculate p{K\y, $). 
3: for j = 1 to K max do 

4: Calculate the approximate conditional posterior p(/3\y,K = j) = 
PG(P\y, K = j) by using Gaussian approximation of p(/3\y, K = j) oc 
p(z|y,yi,/3, A" = j>(/9). 

5: Obtain ^ = E(fi\y, K = j) and af 2 = V(fi\y, K = j). 

6: Calculate an unnormalized posterior for K = j, <x, = p(K = j\y) oc 

p G {»f\y,K=j) 

1: end for 

8: Normalize the model order weights by a s = y "° x — for all s £ 

9: Calculate the mean and variance cr| of marginal posterior of from Eq. 

10: = {/3|0 < (3 = fip ± ia & < /3 max for i = 1, 2, • ■ • }. 
11: Calculate an unnormalize weight = p(0^\y, K = j)atj for j = 
1,2, ••• ,if andm = l,2, ••• ,\Sp\. 

12: Obtain A m) = | 8 . J y for all j £ {1, 2, • • • , K max } and all m £ 

{1,2,- •■ ,|S^|}fromEq!©. 

- Calculate the solution of p{z' i \y). 
13: form = 1 to IS/?] do 

14: for j = 1 to if max do 

15: fore £ C Get r j)C = p{z\ = c, z|y, y-, K = j, 0M). 

16: fore £ C Get r,H = „ T ^ c . 

17: end for 

18: end for 

19: Calculate p{z i = c\y) = EHi EST *£?*?° for all c £ C. 

20: Calculate the expectation and variance of zi from Eq. (fT6l) . 

21: end for 
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Table 1: Benchmark datasets: C (the number classes), d (the dimension of the data), Ntotal (the 
total number of data) 



Name of data 


C 


d 


Ntotal 


Crabs 


4 


5 


200 


Fglass 


4 


9 


214 


Meat 


5 


1050 


231 


Oliveoil 


3 


1050 


65 


Wine 


3 


13 


178 


Iris 


3 


4 


150 



Figure |2] demonstrates reconstructed densities of a testing datum. While top 
subgraphs show the 2 dimensional densities p((3, K\ y), bottom sub-figures repre- 
sent the 1 dimensional densities p(K\y) for all datasets. The graphs illustrate that 
the distribution is not unimodal but a complex multi-modal distribution. This also 
suggests that selecting an appropriate number of neighbours for PKNN is critical 
to obtain high accuracy. 

Asymptotically, MCMC with a large number of iterations will converge and 
therefore can be used in principle to estimate the underlying posterior density. 
Thus, we can check whether the reconstructed density using KOREA is close 
to that estimated by MCMC with a very large number of iterations in order to 
validate the our proposed algorithm. Two subgraphs of figure [3] visualize the sim- 
ilarity between reconstructed posterior densities of a testing data of wine dataset 
by KOREA (red circle line) and MCMC (blue cross line) with small (top) and 
large (bottom) number of samples. (For MCMC, we set the sample size by 100 
for small size and 10000 for large size respectively.) As we can see in the figures, 
our propsed algorithm KOREA is closely approximated to the MCMC algorithm 
with a large number of iterations ize which is commonly regarded as underlying 
reference or pseudo-ground truth density. In order to measure the similarity be- 
tween the reconstructed densities by MCMC and KOREA, we use four different 
metrics as shown in figure |4] Root Mean Square Error (RMSE), Peak Signal-to- 
Noise Ratio (PSNR), Kullback Leibler Distance (KLD) and Structural SIMilarity 
(SSIM) [22]. As in the case of figure[3l MCMC with a large sample size produces 
densities very close to those produced by our proposed KOREA algorithm. As the 
number of MCMC samples increases, RMSE and KLD decrease while PSNR and 
SSIM increases for all datasets. 

Table [2] demonstrates the performance of the each algorithms based on F- 
measure for four cases: kNN, PKNN, KOREA (average) and KOREA (optimal). 
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(a) Crabs 



(d) Oliveoil 



20 30 40 50 



(b) Fglass 



20 30 40 




(e) wine 



20 30 40 50 



(c) Meat 



20 30 40 50 





(f) Iris 



Figure 2: Posterior distribution p(K,(3\y) [top] and its marginalized posterior density p(K\y) 
[bottom] via KOREA 



Since MCMC produces results which are very close to that of KOREA as shown in 
figures |3] and |4l we did not present these results. KOREA (average) and KOREA 
(optimal) represent the mean (marginalized) estimate and MAP estimate of KO- 
REA, respectively. As we can see in the table, KOREA works superior to other 
conventional approaches for all datasets. The results with the best performance 
are highlighted in bold in this table. 

In addition, we compared the simulation times for each of the algorithms. Ta- 
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0.12 



0.12 



(a) AT 



MCMC 



100 



(b)N 



MCMC 



10000 



*-MCMC 
e- KOREA 




MCMC 
-O- KOREA 




Figure 3: Comparison between MCMC and KOREA for wine dataset, Nmcmc denotes the num- 
ber of MCMC iterations. 



ble [3] demonstrates the execution time for all algorithms. Our proposed algorithm 
(PKNN with KOREA) is slower than conventional kNN and PKNN with fixed K 
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RMSE 



KLD 



PSNR 



SSIM 




Figure 4: Implicit similarity check between the reconstructed densities by MCMC and KOREA 
via four well-known metrics. 



but it is much faster than MCMC technique which is regarded as one of the best 
approaches to infer the model parameters and number of neighbours in Bayesian 
framework. From the point of the accuracy of table |2]and the execution time of 
table [3l we eventually find that PKNN can be efficiently improved by using our 
proposed KOREA algorithm and this is a very practically useful technique com- 
pared to the conventional approaches including KNN, PKNN and MCMC. 

6. Discussion 

Our proposed algorithm uses an approach similar to the idea of INLA by re- 
placing the model parameters with the model order (the number of neighbours, k). 
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Table 2: Comparison of F-measures with varying neighbouring structures. The results with the 
best performance are written in bold. 



Methods 


Data 


Asymmetric 


Symmetric 


Boltzman (2) 






model 


Boltzman 






Crabs 


0.72±0.08 


0.74±0.08 


0.74±0.08 




Fglass 


0.64±0.06 


0.67±0.05 


0.67±0.05 




Meat 


0.68±0.07 


0.70±0.07 


0.70±0.07 


KNN 


Oliveoil 


0.74±0.12 


0.71±0.10 


0.71±0.10 




Wine 


0.97±0.01 


0.97±0.01 


0.97±0.01 




Iris 


0.57±0.10 


0.55±0.10 


0.55±0.10 




Crabs 


0.75±0.09 


0.75±0.09 


0.75±0.08 




Fglass 


0.73±0.06 


0.74±0.06 


0.69±0.06 




Meat 


0.70±0.07 


0.71 ±0.07 


0.70±0.06 


PKNN 


Oliveoil 


0.72±0.11 


0.73±0.11 


0.70±0.11 




Wine 


0.98±0.01 


0.98±0.01 


0.98±0.02 




Iris 


0.57±0.12 


0.57±0.12 


0.53±0.10 




Crabs 


0.86±0.11 


0.89±0.11 


0.87±0.09 




Fglass 


A H £L 1 A A A 

0.76±0.09 


0.77±0.07 


A ©1 I A AO 


KOREA 


Meat 


0.68±0.12 


0.75±0.06 


0.7 1 ±0.06 


(average) 


Oliveoil 


0.82±0.17 


0.76±0.19 


0.73±0.20 




Wine 


0.99±0.12 


0.99±0.02 


0.98±0.02 




Iris 


0.62±0.15 


0.58±0.17 


0.56±0.03 




Crabs 


0.86±0.13 


0.89±0.11 


0.87±0.09 




Fglass 


0.79±0.04 


0.76±0.08 


0.79±0.07 


KOREA 


Meat 


0.70±0.11 


0.73±0.07 


0.69±0.l3 


(optimal) 


Oliveoil 


0.80±0.17 


0.78±0.17 


0.76±0.l9 




Wine 


0.99±0.02 


0.99±0.02 


0.98±0.02 




Iris 


0.57±0.17 


0.56±0.19 


0.48±0.l6 



This means that we can speed up the computation by embedding (Quasi-)Newton 
method for Laplace approximation rather than grid sampling as done in the orig- 
inal INLA. However, as we can see in Fig. |2j the posterior is not unimodal so 
we can find local optima rather than global optima for the maximal mode of the 
posterior if we use such a simple Laplace approximation. Therefore, instead of 
(Quasi-)Newton methods employed in the original INLA, we reconstructed the 
density with relatively slower grid approach for the real datasets in the PKNN 
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Table 3: Time comparison: the average of the execution times 



Data 


KNN 


PKNN 


MCMC 
(10000 runs) 


KOREA 


Crabs 


0.10 


0.46 


168.76 


9.77 


Fglass 


0.11 


0.52 


200.59 


10.61 


Meat 


0.12 


0.92 


270.46 


15.66 


Oliveoil 


0.02 


0.13 


34.58 




Wine 


0.08 


0.30 


129.03 


s 


Iris 


0.07 


0.26 


95.47 





of this paper. Of course, if the distribution is uni-modal, then we can use the 
Quasi-Newton method to speed up the algorithm. 

7. Conclusion 

We proposed a model selection algorithm for probabilistic k-nearest neighbour 
(PKNN) classification which is based on functional approximation in Bayesian 
framework. This algorithm has several advantages compared to other conven- 
tional model selection techniques. First of all, the proposed approach can quickly 
provide a proper distribution of the model order k which is not given by other 
approaches, in contrast to time consuming techniques like MCMC. In addition, 
since the proposed algorithm is based on a Bayesian scheme, we do not need to 
run cross validation which is usually used for the performance evaluation. The 
proposed algorithm can also inherit the power of the fast functional approxima- 
tion of INLA. For instance, it can quickly find the optimal number of neighbours k 
and efficiently generate the grid samples by embedding Quasi-Newton method if 
the posterior is uni-modal. Lastly, the proposed approach can calculate the model 
average which is marginalized posterior p(x.\y) = f M p(x.\y, M)p(M\y)dM. 
We also remark that our algorithm is based on a pseudo-likelihood approxima- 
tion of the likelihood and suggest that, although our algorithm has yielded good 
performance, further improvements may result by utilising more accurate approx- 
imations of the likelihood, albeit at the expense of computational run time. 
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